I have a friend who is very fascinated by languages. Whenever we walk around and hear someone speaking we try to guess what their nationality is based on their accent. I thought maybe it would be cool to see if we could use machine learning to classify and predict accents. So I looked online for data and I came across an archive called the speech accent archive (Weinberger, Steven. (2015). Speech Accent Archive. George Mason University. Retrieved from http://accent.gmu.edu) I then began to write a script to scrape the audio files that I will show below.
# import necessary libraries
import requests
import numpy as np
from bs4 import BeautifulSoup
import os
from os.path import basename
import urllib.request
import subprocess
The speech accent archive has 2869 samples of audio coming from many different types of languages. They collected people saynig the same excerpt so that it can be analyzed. The excerpt is: Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station. Every sample says this same excerpt.
#2869 samples
links = []
base_url = 'http://accent.gmu.edu/browse_language.php?function=detail&speakerid='
for i in range(2869):
links.append(base_url+str(i+1))
Fortunately, all the links have the same base url with just a different speaker id ranging from 1 to 2869.
audiolist=[]
for j in range(len(links)):
r = requests.get(links[j])
raw_html = r.content
soup = BeautifulSoup(raw_html, 'html.parser')
a = soup.find_all('source')
audiolist.append(a[0].get('src'))
In the html there is a source file and to retrieve that we can use Beautiful Soup, a library for finding and extracting html.
from IPython.display import Image
Image(filename="accentsarchive.png")
strip_audiolist = []
for k in range(len(audiolist)):
strip_audiolist.append(''.join([i for i in basename(audiolist[k]) if not i.isdigit()]))
strip_audiolist = np.unique(strip_audiolist)
strip_audiolist = np.delete(strip_audiolist, 0)
c = []
for k in range(len(audiolist)):
head, sep, tail = basename(audiolist[k]).partition('.')
c.append(head)
d = []
for k in range(len(strip_audiolist)):
head, sep, tail = strip_audiolist[k].partition('.')
d.append(head)
for i in range(len(d)):
os.mkdir('D:/AudioFiles/'+d[i])
Next create folders where you would like to save each audio file in their respective class (i.e. which type of language/speaker)
for k in range(len(audiolist)):
for l in range(len(strip_audiolist)):
if d[l] in audiolist[k]:
urllib.request.urlretrieve(audiolist[k], 'D:/AudioFiles/'+ d[l] + '/' +basename(audiolist[k]))
subprocess.call(['ffmpeg', '-i', 'D:/AudioFiles/'+ d[l] + '/' +basename(audiolist[k]),
'D:/AudioFiles/'+ d[l] + '/' + c[k] +'.wav'])
from IPython.display import Image
Image(filename="intheend.png")
Using the urllib library we can retrieve the audio files and store them in our desired directories. I also wanted to convert the mp3 into wav files since they are easier to read in python and thus easier to analyze.
from IPython.display import Image
Image(filename="arabic.png")
import pandas as pd
wav = []
dir = []
for i in range(len(d)):
for filename in os.listdir('D:/AudioFiles/' + d[i]):
if filename.endswith(".wav"):
wav.append(filename)
dir.append('D:/AudioFiles/' + d[i] + '/' + filename)
classes = []
for k in range(len(wav)):
head, sep, tail = wav[k].partition('.')
classes.append(head)
dups = []
for k in range(len(classes)):
dups.append(''.join([i for i in classes[k] if not i.isdigit()]))
counts = pd.Series(dups).value_counts()
counts
It would be wise to discard the languages with fewer than a certain amount of counts because it would be hard to train a model with so many classes, especially ones without many examples to use. We can also see that the sample is imbalanced with english containing the majority of the records. To start we can limit the amount of classes to 8. Then once we get a decent model we can try to incorporate more classes. The types of languages that will be used are listed below.
c = counts[counts>64]
c
reduced_dir = []
for i in range(len(dir)):
for j in range(len(c.index)):
if c.index[j] in dir[i]:
reduced_dir.append(dir[i])
len(reduced_dir)
This has reduced the dataset to 1486 audiofiles and 8 classes: english, spanish, arabic, mandarin, korean, french, russian, and portugese.
There are many many many features in audio. There is a whole field dedicated entirely for it, and fortunately there are also many python libraries built for analyzing audio.
wav = []
dir = []
d= []
for i in range(len(os.listdir('D:/AudioFiles/'))):
d.append((os.listdir('D:/AudioFiles/')[i]))
for i in range(len(d)):
for filename in os.listdir('D:/AudioFiles/' + d[i]):
if filename.endswith(".wav"):
wav.append(filename)
dir.append('D:/AudioFiles/' + d[i] + '/' + filename)
import librosa
import librosa.display
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
classes = []
for k in range(len(wav)):
head, sep, tail = wav[k].partition('.')
classes.append(head)
dups = []
for k in range(len(classes)):
dups.append(''.join([i for i in classes[k] if not i.isdigit()]))
counts = pd.Series(dups).value_counts()
c = counts[counts>64]
reduced_dir = []
for i in range(len(dir)):
for j in range(len(c.index)):
if c.index[j] in dir[i]:
reduced_dir.append(dir[i])
wavs = []
for j in range(len(reduced_dir)):
wavs.append(basename(reduced_dir[j]))
classes = []
for k in range(len(wavs)):
head, sep, tail = wavs[k].partition('.')
classes.append(head)
dups = []
for k in range(len(classes)):
dups.append(''.join([i for i in classes[k] if not i.isdigit()]))
There are many visualization methods for audio such as a spectrogram, chromogram, waveplot, colour map, and tempogram. Below are the different visualizations for an arabic accent audio sample.
y, sr = librosa.load(reduced_dir[0])
plt.figure(figsize=(12, 8))
D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
plt.subplot(4, 2, 1)
librosa.display.specshow(D, y_axis='linear')
plt.colorbar(format='%+2.0f dB')
plt.title('Linear-frequency power spectrogram')
# Or on a logarithmic scale
plt.subplot(4, 2, 2)
librosa.display.specshow(D, y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Log-frequency power spectrogram')
# Or use a CQT scale
CQT = librosa.amplitude_to_db(np.abs(librosa.cqt(y, sr=sr)), ref=np.max)
plt.subplot(4, 2, 3)
librosa.display.specshow(CQT, y_axis='cqt_note')
plt.colorbar(format='%+2.0f dB')
plt.title('Constant-Q power spectrogram (note)')
plt.subplot(4, 2, 4)
librosa.display.specshow(CQT, y_axis='cqt_hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Constant-Q power spectrogram (Hz)')
# Draw a chromagram with pitch classes
C = librosa.feature.chroma_cqt(y=y, sr=sr)
plt.subplot(4, 2, 5)
librosa.display.specshow(C, y_axis='chroma')
plt.colorbar()
plt.title('Chromagram')
# Force a grayscale colormap (white -> black)
plt.subplot(4, 2, 6)
librosa.display.specshow(D, cmap='gray_r', y_axis='linear')
plt.colorbar(format='%+2.0f dB')
plt.title('Linear power spectrogram (grayscale)')
# Draw time markers automatically
plt.subplot(4, 2, 7)
librosa.display.specshow(D, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Log power spectrogram')
# Draw a tempogram with BPM markers
plt.subplot(4, 2, 8)
Tgram = librosa.feature.tempogram(y=y, sr=sr)
librosa.display.specshow(Tgram, x_axis='time', y_axis='tempo')
plt.colorbar()
plt.title('Tempogram')
plt.tight_layout()
plt.show()
We can try plotting power spectrograms for each of the 8 classes to see differences between classes. We can see that there is high inter-class variance and intra-class variance which can make classification very tricky.
y, sr = librosa.load(reduced_dir[0], duration=20)
plt.figure(figsize=(12, 8))
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 1)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Arabic')
y, sr = librosa.load(reduced_dir[1], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 2)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Arabic')
y, sr = librosa.load(reduced_dir[2], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 3)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
y, sr = librosa.load(reduced_dir[3], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 4)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.show()
y, sr = librosa.load(reduced_dir[176], duration=20)
plt.figure(figsize=(12, 8))
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 1)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('English')
y, sr = librosa.load(reduced_dir[177], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 2)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('English')
y, sr = librosa.load(reduced_dir[178], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 3)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
y, sr = librosa.load(reduced_dir[179], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 4)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.show()
y, sr = librosa.load(reduced_dir[820], duration=20)
plt.figure(figsize=(12, 8))
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 1)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('French')
y, sr = librosa.load(reduced_dir[821], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 2)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('French')
y, sr = librosa.load(reduced_dir[822], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 3)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
y, sr = librosa.load(reduced_dir[823], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 4)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.show()
y, sr = librosa.load(reduced_dir[890], duration=20)
plt.figure(figsize=(12, 8))
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 1)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Korean')
y, sr = librosa.load(reduced_dir[891], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 2)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Korean')
y, sr = librosa.load(reduced_dir[892], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 3)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
y, sr = librosa.load(reduced_dir[893], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 4)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.show()
y, sr = librosa.load(reduced_dir[990], duration=20)
plt.figure(figsize=(12, 8))
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 1)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Mandarin')
y, sr = librosa.load(reduced_dir[991], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 2)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Mandarin')
y, sr = librosa.load(reduced_dir[992], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 3)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
y, sr = librosa.load(reduced_dir[993], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 4)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.show()
y, sr = librosa.load(reduced_dir[1190], duration=20)
plt.figure(figsize=(12, 8))
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 1)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Russian')
y, sr = librosa.load(reduced_dir[1191], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 2)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Russian')
y, sr = librosa.load(reduced_dir[1192], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 3)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
y, sr = librosa.load(reduced_dir[1193], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 4)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.show()
y, sr = librosa.load(reduced_dir[1150], duration=20)
plt.figure(figsize=(12, 8))
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 1)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Portugese')
y, sr = librosa.load(reduced_dir[1151], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 2)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Portugese')
y, sr = librosa.load(reduced_dir[1152], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 3)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
y, sr = librosa.load(reduced_dir[1153], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 4)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.show()
y, sr = librosa.load(reduced_dir[1290], duration=20)
plt.figure(figsize=(12, 8))
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 1)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Spanish')
y, sr = librosa.load(reduced_dir[1291], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 2)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Spanish')
y, sr = librosa.load(reduced_dir[1292], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 3)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
y, sr = librosa.load(reduced_dir[1293], duration=20)
X = librosa.stft(y)
Xdb = librosa.amplitude_to_db(abs(X))
plt.subplot(4, 2, 4)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.show()
Using librosa we can also extract the mel-frequency cepstral coefficients (MFCC). MFCC values mimic human hearing, and they are commonly used in speech recognition applications. These MFCC values can be fed directly into the neural network.
mels = []
for filename in range(len(reduced_dir)):
y, sr = librosa.load(reduced_dir[filename], duration=20) #capture the audio time series and the sampling rate and caps it at 20 seconds
mfcc = librosa.feature.mfcc(y)
mfcc/= np.amax(np.absolute(mfcc)) # normalize the values
mels.append(mfcc.flatten())
mfcc_df = pd.DataFrame(mels)
mfcc_df = mfcc_df.assign(label = pd.Series(dups).values)
# need to get rid of some lone subclasses. These were different dialects of the main languages
df = mfcc_df[mfcc_df.label != 'charapa-spanish']
df = mfcc_df[mfcc_df.label != 'haitiancreolefrench']
df = df.drop(df.index[1255])
df['label'] = pd.Categorical(df['label'])
df['label'] = df.label.cat.codes
df = df.fillna(0)
Since the classes are imbalanced, it would be wise to split the train and test sets proportionally. This can be done using sci-kit learn's Stratified Shuffle Split
from sklearn.model_selection import StratifiedShuffleSplit
data = df.drop(columns=['label'])
labels = df.label
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data,labels):
X_train, X_test = data.iloc[train_index], data.iloc[test_index]
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
# check the proportions
labels.value_counts(normalize=True)
y_train.value_counts(normalize=True)
y_test.value_counts(normalize=True)
The proportions of the classes for the train and test set are all similar to the original dataset.
%reload_ext tensorboard
from tensorflow import keras
import tensorflow as tf
import datetime
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(1000, activation = 'relu', input_dim=np.shape(X_train)[1]),
tf.keras.layers.Dense(600, activation='relu'),
#tf.keras.layers.Dropout(0.5),
#tf.keras.layers.Dense(300, activation='relu'),
#tf.keras.layers.Dropout(0.5),
#tf.keras.layers.Dense(800, activation='relu'),
#tf.keras.layers.Dropout(0.5),
#tf.keras.layers.Dense(500, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(50, activation='relu'),
tf.keras.layers.Dense(8, activation = 'softmax')
])
model.compile(tf.keras.optimizers.Adam(learning_rate=0.001),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
log_dir="D:\\logs\\fit\\" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
history = model.fit(x=X_train,
y=y_train,
epochs=20,
validation_data=(X_test, y_test),
callbacks=[tensorboard_callback])
#%tensorboard --logdir logs/fit
The accuracy shows overfitting since the validation accuracy does not improve as the training accuracy improves. This means the model's hyperparameters needs to be tuned more, or another form of input data needs to be used. Applied deep learning is a very empirical process. We can do some trial and error using tensorboard.